NSF PAR Search | NSF Public Access Repository

S2FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Yang, X; Leng, J; Guo, G; Zhao, J; Nakada, R; Zhang, L; Yao, H; Chen, B (December 2024, NeurIPS)

Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of \underline{S}tructured \underline{S}parse \underline{F}ine-\underline{T}uning (\textbf{\model}) methods for LLMs, which \textit{concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability}. \model \mbox{accomplishes this by ``selecting sparsely and computing densely". It selects a few} heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, \model performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents overfitting and forgetting, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6$$\%$$ and 1.3$$\%$$ average improvements compared to LoRA, and surpasses full FT by 11.5$$\%$$ when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, \model saves training memory up to 3$$\times$$ and improves latency by 1.5-2.7$$\times$$ compared to full FT, while delivering an average 10\% improvement over LoRA on both metrics. We further demonstrate that the weight updates in \model can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.

Full Text Available

Search for: All records